Exploratory Data Analysis¶
The main objectives for this notebook are:
- Explore the clean dataset by performing univariate analysis
- Investiage the relationships between the features and the target by perofrming bivariate and multivariate analyses
- Extract relevant insights to share with business stakeholders
- Understand steps that will be required for ML pre-processing
Notes¶
- Using Polars framwork instead of pandas
- Using interactive plots (e.g. Plotly) for visualisations
- Write clear insights after every section of the analysis
- Using well-written and documented utility functions
Imports¶
In [1]:
%load_ext autoreload
%autoreload 2
In [2]:
import sys, os
import plotly.io as pio
# Path needs to be added manually to read from another folder
path2add = os.path.normpath(os.path.abspath(os.path.join(os.path.dirname('__file__'), os.path.pardir, 'utils')))
if (not (path2add in sys.path)) :
sys.path.append(path2add)
import polars as pl
import plotly.express as px
from visualisations import bar_plot, proportion_plot, boxplot_by_bin_with_target
# etc
pio.renderers.default='notebook'
In [3]:
import pandas as pd
data = pd.read_parquet("../data/supervised_clean_data.parquet")
In [4]:
data.head()
Out[4]:
| _id | inter_api_access_duration(sec) | api_access_uniqueness | sequence_length(count) | vsession_duration(min) | ip_type | num_sessions | num_users | num_unique_apis | source | classification | is_anomaly | ||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 1f2c32d8-2d6e-3b68-bc46-789469f2b71e | 0.000812 | 0.004066 | 85.643243 | 5405 | default | 1460.0 | 1295.0 | 451.0 | E | normal | False |
| 1 | 1 | 4c486414-d4f5-33f6-b485-24a8ed2925e8 | 0.000063 | 0.002211 | 16.166805 | 519 | default | 9299.0 | 8447.0 | 302.0 | E | normal | False |
| 2 | 2 | 7e5838fc-bce1-371f-a3ac-d8a0b2a05d9a | 0.004481 | 0.015324 | 99.573276 | 6211 | default | 255.0 | 232.0 | 354.0 | E | normal | False |
| 3 | 3 | 82661ecd-d87f-3dff-855e-378f7cb6d912 | 0.017837 | 0.014974 | 69.792793 | 8292 | default | 195.0 | 111.0 | 116.0 | E | normal | False |
| 4 | 4 | d62d56ea-775e-328c-8b08-db7ad7f834e5 | 0.000797 | 0.006056 | 14.952756 | 182 | default | 272.0 | 254.0 | 23.0 | E | normal | False |
In [5]:
data.shape
Out[5]:
(1695, 13)
In [6]:
data.info
Out[6]:
<bound method DataFrame.info of _id \
0 0 1f2c32d8-2d6e-3b68-bc46-789469f2b71e
1 1 4c486414-d4f5-33f6-b485-24a8ed2925e8
2 2 7e5838fc-bce1-371f-a3ac-d8a0b2a05d9a
3 3 82661ecd-d87f-3dff-855e-378f7cb6d912
4 4 d62d56ea-775e-328c-8b08-db7ad7f834e5
... ... ...
1690 1694 3653d165-4b93-346b-9543-f1d4f5bf4831
1691 1695 44356d09-52e9-321e-9ec1-630e582bfe53
1692 1696 0ecdc692-df55-3990-815e-a30f1ee63f5f
1693 1697 468a84b3-2885-30d6-b1a8-6cf2e44577cd
1694 1698 2854b436-7d8b-3f2c-8139-3340ad2cd45a
inter_api_access_duration(sec) api_access_uniqueness \
0 0.000812 0.004066
1 0.000063 0.002211
2 0.004481 0.015324
3 0.017837 0.014974
4 0.000797 0.006056
... ... ...
1690 45.603433 0.800000
1691 852.929250 0.500000
1692 59.243000 0.800000
1693 0.754000 0.666667
1694 66.934857 0.428571
sequence_length(count) vsession_duration(min) ip_type \
0 85.643243 5405 default
1 16.166805 519 default
2 99.573276 6211 default
3 69.792793 8292 default
4 14.952756 182 default
... ... ... ...
1690 15.000000 41044 datacenter
1691 2.000000 102352 datacenter
1692 5.000000 17773 datacenter
1693 3.000000 136 datacenter
1694 7.000000 28113 datacenter
num_sessions num_users num_unique_apis source classification \
0 1460.0 1295.0 451.0 E normal
1 9299.0 8447.0 302.0 E normal
2 255.0 232.0 354.0 E normal
3 195.0 111.0 116.0 E normal
4 272.0 254.0 23.0 E normal
... ... ... ... ... ...
1690 2.0 1.0 12.0 F outlier
1691 2.0 1.0 1.0 F outlier
1692 3.0 1.0 4.0 F outlier
1693 2.0 1.0 2.0 F outlier
1694 3.0 1.0 3.0 F outlier
is_anomaly
0 False
1 False
2 False
3 False
4 False
... ...
1690 True
1691 True
1692 True
1693 True
1694 True
[1695 rows x 13 columns]>
In [7]:
data.isna().sum()
Out[7]:
0 _id 0 inter_api_access_duration(sec) 0 api_access_uniqueness 0 sequence_length(count) 0 vsession_duration(min) 0 ip_type 0 num_sessions 0 num_users 0 num_unique_apis 0 source 0 classification 0 is_anomaly 0 dtype: int64
In [8]:
data = pl.read_parquet("../data/supervised_clean_data.parquet")
print(data.shape)
data.head()
(1695, 13)
Out[8]:
shape: (5, 13)
| _id | inter_api_access_duration(sec) | api_access_uniqueness | sequence_length(count) | vsession_duration(min) | ip_type | num_sessions | num_users | num_unique_apis | source | classification | is_anomaly | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| i64 | str | f64 | f64 | f64 | i64 | str | f64 | f64 | f64 | str | str | bool |
| 0 | "1f2c32d8-2d6e-3b68-bc46-789469… | 0.000812 | 0.004066 | 85.643243 | 5405 | "default" | 1460.0 | 1295.0 | 451.0 | "E" | "normal" | false |
| 1 | "4c486414-d4f5-33f6-b485-24a8ed… | 0.000063 | 0.002211 | 16.166805 | 519 | "default" | 9299.0 | 8447.0 | 302.0 | "E" | "normal" | false |
| 2 | "7e5838fc-bce1-371f-a3ac-d8a0b2… | 0.004481 | 0.015324 | 99.573276 | 6211 | "default" | 255.0 | 232.0 | 354.0 | "E" | "normal" | false |
| 3 | "82661ecd-d87f-3dff-855e-378f7c… | 0.017837 | 0.014974 | 69.792793 | 8292 | "default" | 195.0 | 111.0 | 116.0 | "E" | "normal" | false |
| 4 | "d62d56ea-775e-328c-8b08-db7ad7… | 0.000797 | 0.006056 | 14.952756 | 182 | "default" | 272.0 | 254.0 | 23.0 | "E" | "normal" | false |
In [9]:
data['ip_type'].unique()
Out[9]:
shape: (2,)
| ip_type |
|---|
| str |
| "default" |
| "datacenter" |
Univariate Analysis¶
This section goes through the avilable columns and plots them to see the distributions, outliers, etc. This is done to introduce the data set and to get familiar with it
In [10]:
bar_plot(data, "ip_type", "IP Type Counts",)
Observations:
- There are just two ip types -
defaultanddatacenter, withdefaultbeing the most frequent one
Features vs Target¶
This section performs a bi-variate analysis by looking at the distributions of normal vs outliers. This can help in determining what data and feature selection to perform.
In [11]:
proportion_plot(data, "ip_type", "is_anomaly", "Behaviour Type by Source")
Observations:
- If the acitivty comes from a
datacenter, it's guaranteed to be an outlier
Impact
- The dataset needs to be filtered to include only
defaulttraffic since we don't need a model to classifydatacentertraffic
Hypotheses¶
Are longer sessions with high speed inter API calls more anomalous?¶
It's usually the case that if a lot of events happen in a short period of time - this might signal bot or other malicious activity. Let's see if it's the case for this dataset
In [12]:
boxplot_by_bin_with_target(
data = data,
column_to_bin = "sequence_length(count)",
numeric_column = "inter_api_access_duration(sec)",
target = "is_anomaly"
)
Observations
- Outliers have faster inter API duration than normal traffic
Insights
- Longer sequences with faster inter API access duration are not more likely to be anomalous
Summary¶
Main Insights¶
- Most of the traffic comes from the default source, only 9% comes from datacenters
- All the datacenter traffic is considered to be anomalous
- Longer sequences with faster inter API access durations are not more likely to be anomalous
Implications for Modelling¶
- Dataset needs to be filter to include only the
defaultsource type